Skip to content

fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience)#63

Merged
FumingPower3925 merged 3 commits into
mainfrom
fix/h2-flowctl-and-h1-churn-backoff
Jun 21, 2026
Merged

fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience)#63
FumingPower3925 merged 3 commits into
mainfrom
fix/h2-flowctl-and-h1-churn-backoff

Conversation

@FumingPower3925

@FumingPower3925 FumingPower3925 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Three resilience fixes so the full benchmark matrix survives servers it previously DNF'd on. Each is the loadgen side of a column that produced zero requests in the last full run.

1. h1client: backoff-paced reconnect on read EOF

The write-error path already reconnected + backed off; the read-status and read-header EOF paths returned the bare error. A Connection: close server — or one closing mid-response under churn-close — surfaces as a read EOF, so a close-after-one-response server spun read-EOFs with no pacing and never recovered. drogon collapsed to 0 successful requests (churn-close cell) from exactly this.

2. h2client: honor server MAX_FRAME_SIZE + send-window flow control

POST bodies > 16384 B were sent as a single oversized DATA frame (FRAME_SIZE_ERROR) and bodies > the 65535 initial send window overran flow control — the post-64k-h2 failure (5 h2 columns: aspnet-h2, axum-h2, elysia-h2, hono-h2, hyper-h2). Now: capture the server's SETTINGS_MAX_FRAME_SIZE + SETTINGS_INITIAL_WINDOW_SIZE at handshake, split at the frame size, and pace against the connection + per-stream send windows, replenished from WINDOW_UPDATE. Regression test posts 200000 B through a strict 16384/65535 h2c server; pre-fix single-frame send fails it with GOAWAY code=6.

3. h2client: re-dial connections the server closes/GOAWAYs mid-cell

The h2 client dialed once in New() and never recovered a torn-down conn. A server that GOAWAYs/closes connections periodically — hypercorn does, so fastapi-h2 hit it every cell: ~1.1 billion errors / 0 requests per 35 s cell — left the slot dead and every DoRequest hot-looped the closed-conn error (the h2 analog of #1). Each conn is now an h2ConnSlot (atomic.Pointer + single-flight redial + backoff); DoRequest re-dials a dead slot and swaps the fresh conn in lock-free.

Verification

  • Full suite green on x/net 0.56.0, including under -race.
  • New regression tests for the flow-control split and the reconnect path.
  • Live hypercorn h2c repro (the fastapi-h2 server): GET 0 → 47.6k req (0.4% err), POST-65536-body 0 → 16k req — both previously zero.

A Connection:close server (or one closing mid-response under churn-close)
surfaces as a read EOF on the status line or a header line, not just on the
write. The write-error path already reconnects + backs off; the two read
paths returned the bare error, so a close-after-one-response server spun
read-EOFs with no pacing and never re-established a usable conn. drogon
collapsed to 0 successful requests from exactly this. Mirror the write path:
reconnect for the next request, recordConnectError + backoff only when the
server is genuinely down, otherwise reset the backoff.
POST bodies larger than 16384 B were sent as a single oversized DATA frame
(FRAME_SIZE_ERROR against a 16384-default server) and bodies larger than the
65535 initial send window overran flow control (FLOW_CONTROL_ERROR / hang to
the 5-min deadline). This is the post-64k-h2 failure (64 KiB body = 65536 B,
one past the window). Capture the server's SETTINGS_MAX_FRAME_SIZE and
SETTINGS_INITIAL_WINDOW_SIZE at handshake; split the body at the server's
frame size and pace it against the connection (RFC 7540 §6.9.2: starts 65535)
and per-stream send windows, replenished from WINDOW_UPDATE in readLoop. The
writer goroutine is sequential, so the active stream's window is tracked by
curStreamID/curStreamWindow. Drop the now-dead h2WriteReq.maxFrame field.

Regression test posts a 200000-B body through a strict h2c server advertising
16384/65535 (matching real bench targets, not x/net's lenient 1 MiB
defaults); the pre-fix single-frame send fails it with GOAWAY code=6.
@FumingPower3925 FumingPower3925 force-pushed the fix/h2-flowctl-and-h1-churn-backoff branch from c30c81e to a9fb908 Compare June 21, 2026 10:11
The h2 client dialed its connections once in New() and never recovered one
the server tore down. A server that GOAWAYs or closes connections
periodically — hypercorn does, so the fastapi-h2 column hit it on every
cell — left the slot permanently dead: readLoop marks the conn closed on
GOAWAY and returns, then every DoRequest returns the bare closed-conn error
with no pacing. fastapi-h2 logged ~1.1 billion errors / 0 successful
requests per 35 s cell from exactly this hot loop (the h2 analog of the h1
churn-close bug).

Wrap each connection in an h2ConnSlot (atomic.Pointer to the live conn +
single-flight redial mutex + connectBackoff). DoRequest re-dials a dead slot,
paced by the slot backoff, swapping the fresh conn in atomically so sibling
workers pick it up lock-free. Verified against a live hypercorn h2c server:
GET 0 -> 47.6k req (0.4% error), POST-65536-body 0 -> 16k req — both
previously zero. Regression test closes a live conn out from under the
client and asserts the next requests recover against the still-running
server.
@FumingPower3925 FumingPower3925 changed the title fix(loadgen): h1 churn-backoff on read EOF + h2 frame-size/flow-control (post-64k-h2) fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience) Jun 21, 2026
@FumingPower3925 FumingPower3925 merged commit 7328187 into main Jun 21, 2026
3 checks passed
@FumingPower3925 FumingPower3925 deleted the fix/h2-flowctl-and-h1-churn-backoff branch June 21, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant